Lab 2

Advanced Data Visualization

Author

Hannah Pawig

Instructions

Click to expand/collapse
# Package names
packages <- c("tidyverse", "here", "readxl", "scales", "RColorBrewer", "leaflet",
              "sf", "rnaturalearth", "countrycode", "plotly")


# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))

Create a Quarto file for ALL Lab 2 (no separate files for Parts 1 and 2).

  • Make sure your final file is carefully formatted, so that each analysis is clear and concise.
  • Be sure your knitted .html file shows all your source code, including any function definitions.

Part One: Identifying Bad Visualizations

If you happen to be bored and looking for a sensible chuckle, you should check out these Bad Visualisations. Looking through these is also a good exercise in cataloging what makes a visualization good or bad.

Dissecting a Bad Visualization

Below is an example of a less-than-ideal visualization from the collection linked above. It comes to us from data provided for the Wellcome Global Monitor 2018 report by the Gallup World Poll:

Click to expand/collapse
knitr::include_graphics(here::here("image",
                                   "bad-wellcome-graph.jpg"), error = FALSE)

  1. While there are certainly issues with this image, do your best to tell the story of this graph in words. That is, what is this graph telling you? What do you think the authors meant to convey with it?

This graph shows the percentages of people within countries that believe in the safety of vaccines. This graph has sorted the countries by their global region and determines the ordering of the global region by using the region’s median percentage of people that believe in the safety of vaccines. The graph tries to convince the audience that certain global regions are more likely to believe that vaccines are safe, and also gives us information on individual countries’ percentages.

  1. List the variables that appear to be displayed in this visualization. Hint: Variables refer to columns in the data.
  • Country: The country in which the survey was conducted.
  • Region: The global region in which the country is located.
  • Percentage: The percentage of people in the country that believe in the safety of vaccines.
  1. Now that you’re versed in the grammar of graphics (e.g., ggplot), list the aesthetics used and which variables are mapped to each.

Aesthetics: - the percentage of people who believe that vaccines are safe is mapped to the x-axis. - the global region variable is mapped to the color aesthetic and the facet aesthetic.

  1. What type of graph would you call this? Meaning, what geom would you use to produce this plot?

I would call this a scatterplot and I would use geom_point() to produce this plot.

  1. Provide at least four problems or changes that would improve this graph. Please format your changes as bullet points!
  • I think the graph would look better and be easier to understand if we rotated it 90º (i.e. using coord_flip)
  • I feel like a box plot for each global region would be better here, if the main goal is to compare the median percentage of people who believe that vaccines are safe across global regions. Otherwise, I might consider using some kind of heat map that shows the percentage based on varying degrees of color intensity so that a person can visualize on a map people’s opinion of vaccine safety.
  • I would probably change the color palette, because the current one is not necessarily attractive. I’d make sure the new color palette is color-blind friendly.
  • I would not use Comic sans; I’d change it to another sans font. Comic sans is not necessarily fitting for this graph nor for scientific observations.
  • Stacking each global region’s section on top of each other is misleading. To my ey, it appears all the countries of Asia have higher reported percentages of thinking vaccines are safe compared to all countries in America. Again I think the graph would benefit if the percentage variable was on the y-axis instead of the x-axis. This would also allow an easier interpretation of percentages between regions.
  • The regions’ ordering is meaningless; I’d probably redefine the regions to make them more meaningful (i.e. based on the seven main world regions). I’m not quite sure why former soviet union was chosen to be a world region for this graph.

Improving the Bad Visualization

The data for the Wellcome Global Monitor 2018 report can be downloaded at the following site: https://wellcome.ac.uk/reports/wellcome-global-monitor/2018

There are two worksheets in the downloaded dataset file. You may need to read them in separately, but you may also just use one if it suffices.

Click to expand/collapse
wellcome_data <- read_excel(here::here("data",
                                               "wgm2018-dataset-crosstabs-all-countries.xlsx"),
  sheet = "Crosstabs all countries",
  skip = 2, 
  col_names = TRUE) 

wd <- wellcome_data |>
  janitor::clean_names() |> 
  select(
    country:response, column_n_percent_4
  ) |> 
  filter(
    response %in% c("Strongly agree", "Somewhat agree")
  ) |> 
  fill(question) |> 
  filter(
    question == "Q25 Do you strongly or somewhat agree, strongly or somewhat disagree or neither agree nor disagree with the following statement? Vaccines are safe."
  )

full_df <- read_excel(here('data',
                                   "wgm2018-dataset-crosstabs-all-countries.xlsx"),
                              sheet = "Full dataset",
                              skip = 0,
                              col_names = TRUE)

data_dict <- read_excel(here('data',
                                   "wgm2018-dataset-crosstabs-all-countries.xlsx"),
                              sheet = "Data dictionary",
                              skip = 0,
                              col_names = TRUE) |> 
  filter(
    `Variable Name` %in% c('WP5', 'Regions_Report')
  )

# Creating a tibble of 2 columns: country codes and country names
country_w_codes <- data_dict |> 
  filter(`Variable Name` == "WP5") |> 
  mutate(
    country_code = str_split(
      string = `Variable Type & Codes*`,
      pattern = ",",
      n = length(unique(full_df$WP5)), 
      simplify = TRUE
      )
    
  ) |> 
  janitor::clean_names() |> 
  # drop first few columns
  select(-variable_type_codes, -variable_name, -variable_type_codes,
         -variable_label_survey_question, -notes) |> 
  unlist() |> # turns list into a column
  as_tibble() |> 
  rename(country = value) |> 
  mutate(
    code = str_split(country, "=", n = 2, simplify = TRUE)[, 1],
    country = str_split(country, "=", n = 2, simplify = TRUE)[, 2],
    country = str_remove(country, ","),
    code = as.numeric(code)
  )

# create tibble with two columns: Region code and region name
regions_codes <- data_dict |> 
  filter(`Variable Name` == "Regions_Report") |> 
  mutate(
    country_code = str_split(`Variable Type & Codes*`, ",", n = length(unique(full_df$Regions_Report)), simplify = TRUE)
    
  ) |> 
  janitor::clean_names() |> 
  # drop first few columns
  select(-variable_type_codes, -variable_name, -variable_type_codes,
         -variable_label_survey_question, -notes) |> 
  unlist() |> 
  as_tibble() |> 
  rename(region = value) |> 
  mutate(
    code = str_split(region, "=", n = 2, simplify = TRUE)[, 1],
    region = str_split(region, "=", n = 2, simplify = TRUE)[, 2],
    region = str_remove(region, ","),
    code = as.numeric(code)
  )
Click to expand/collapse
former_soviet <- c(
  "Armenia", "Azerbaijan", "Belarus", "Estonia", "Georgia",
  "Kazakhstan", "Kyrgyzstan", "Latvia", "Lithuania", "Moldova",
  "Tajikistan", "Turkmenistan", "Ukraine", "Uzbekistan", "Russia"
)
Click to expand/collapse
# create df with country and assigned region
country_region <- full_df |> 
  select(WP5, Regions_Report) |> 
  distinct() |> 
  left_join(country_w_codes, by = c("WP5" = "code")) |> 
  left_join(regions_codes, by = c("Regions_Report" = "code")) |> 
  select(country, region) |> 
 # replace republic of congo and palestine to match Crosstab country list
 mutate(
   country = case_when(
     str_detect(country, "Palestinian") ~ "Palestine",
     country == "Republic of Congo" ~ "Congo, Rep.",
     TRUE ~ country
   )
 )

# assign region to plotting data frame with a join
plot_df <- wd |>
  left_join(country_region, by = "country") |> 
  # create new regions
  mutate(
    continent = case_when(
      str_detect(region, "Asia") ~ "Asia",
      str_detect(region, "America") ~ "Americas",
      str_detect(region, "Europe") ~ "Europe",
      str_detect(region, "Africa") ~ "Africa",
      region == "Middle East" ~ "Middle East and North Africa",
      region == "Aus/NZ" ~ "Oceania",
      TRUE ~ "Not Assigned"
    )
  ) 
Click to expand/collapse
plot_df <- plot_df |>
  # calculate percentage of vaccine agree %s by country
  group_by(country) |>
  mutate(
    percentage = sum(column_n_percent_4, na.rm = TRUE)
  ) |>
  ungroup() |>

  # calculate median percentage of vaccine agree %s by region
  group_by(continent) |>
  mutate(
    median_percentage = median(percentage, na.rm = TRUE)
  ) |>
  ungroup() |>
  
  # only keep one row for each country (remove dupes)
  filter(response != "Somewhat agree") |> 
  select(country, region, percentage, median_percentage, continent) |>
  # ordering of region and country
  mutate(
    country = fct_reorder(country, percentage)
  )
  1. Improve the visualization above by either re-creating it with the issues you identified fixed OR by creating a new visualization that you believe tells the same story better.
Click to expand/collapse
# custom function to get continent sizes
n_fun <- function(x){
  return(data.frame(y = 0.995,
                    label = paste0(
                      "n = ", length(x))))
}

update_geom_defaults("text",
                   list(size = 2.7,
                        family = "sans"))

plot <- plot_df |>
  filter(continent != "Not Assigned") |>
  ggplot(mapping = aes(
    x = continent,
    y = percentage,
    fill = continent)) +
  geom_boxplot() +
  labs(
    title = "Percentage of People Who Believe Vaccines are Safe, by Continent",
    subtitle = "n = number of countries",
    x = "",
    y = ""
  ) +
  theme_bw() +
  theme(
    text = element_text(family = "sans"),
    legend.position = "none",
    plot.title = element_text(hjust = 1.13),
    plot.subtitle = element_text(hjust = -0.67,
                                 face = "italic"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_y_continuous(
    labels = scales::percent_format(scale = 100),
    breaks = seq(0, 1, by = 0.25),
    limits = c(0.24,1.2)
  ) +
  scale_fill_brewer(palette = 2, type = "qual") + 
  stat_summary(fun.data = n_fun, 
               geom = "text", 
               hjust = 0.27) +
  coord_flip()

ggsave(
  here::here("image", "improved-wellcome-graph.png"),
  plot = plot,
  width = 6,
  height = 4,
  dpi = 300
)

plot

Part Two: Broad Visualization Improvement

The full Wellcome Global Monitor 2018 report can be found here: https://wellcome.ac.uk/sites/default/files/wellcome-global-monitor-2018.pdf. Surprisingly, the visualization above does not appear in the report despite the citation in the bottom corner of the image!

Second Data Visualization Improvement

For this second plot, you must select a plot that uses maps so you can demonstrate your proficiency with the leaflet package!

  1. Select a data visualization in the report that you think could be improved. Be sure to cite both the page number and figure title. Do your best to tell the story of this graph in words. That is, what is this graph telling you? What do you think the authors meant to convey with it?

I selected chart 2.14, the “Map of interest in knowing more about medicine, disease or health by country” on page 39. This map shows the percentage of people that reported “yes” to the survey question “Would you, personally, like to know more about medicine, disease or health?”. The darker the color of the country, the “more interested” that country’s people are. I think the authors mean to convey countries’ “openness” to learning more about health and medicine and show where this openness is more concentrated on the globe.

  1. List the variables that appear to be displayed in this visualization.

The variables that appear on this map are:

  • percentage of people that replied yes to this survey question
  • country name
  1. Now that you’re versed in the grammar of graphics (ggplot), list the aesthetics used and which variables are specified for each.

The percentage variable is mapped to polygon fill color. Country name and their respective percentage is mapped to the polygon labels.

  1. What type of graph would you call this?

Looking up graph types on maps, this is a choropleth graph (https://www.esri.com/arcgis-blog/products/insights/analytics/data-visualization-types).

  1. List all of the problems or things you would improve about this graph.
  • I would improve the graph’s functionality. I know this is graph on a pdf, but I would want to be able to hover over a country and see the exact percentage and name of the country.
  • I would also add a legend to the map that shows the percentage of people that answered “yes” to the survey question.
  • I think I would use a different palette that has two colors on each end of the scale, rather than using different shades of green.
  • I think there could be more labeling or text on the map itself.
  1. Improve the visualization above by either re-creating it with the issues you identified fixed OR by creating a new visualization that you believe tells the same story better.
Click to expand/collapse
map_df <- wellcome_data |>
  janitor::clean_names() |> 
  select(
    country:response, column_n_percent_4
  ) |> 
  filter(
    response == "Yes",
    str_detect(question, pattern = "Q9")
  ) |> 
  rename(
    percentage_yes = column_n_percent_4
  ) |> 
  mutate(
    percentage_yes = 100*round(percentage_yes, 2)
  ) |> 
  select(country, percentage_yes) |> 
  mutate(iso_a3 = countrycode(country,
                            origin = "country.name",
                            destination = "iso3c"),
         iso_a3 = case_when(
           country == "Kosovo" ~ "XKX",
           TRUE ~ iso_a3
         )) 
Click to expand/collapse
map <- leaflet() |> 
  addTiles() 



# Get world map data from Natural Earth
world <- ne_countries(scale = "medium", # size
                      returnclass = "sf") # output object

# Merge Poll data with the world map data
world_data <- world |>
  left_join(map_df, by = c("adm0_a3" = "iso_a3"))

# Define color palette based on percentage
pal <- colorNumeric(
  palette = "YlOrRd", # Color palette
  domain = world_data$percentage_yes
)

# Create leaflet map
# add a plot label
map_plot <- world_data |> 
  leaflet() |> 
  addTiles() |> 
  addPolygons(
    fillColor = ~pal(percentage_yes),
    color = "black", # Border color
    weight = 1, # Border weight
    fillOpacity = 0.7,
    highlightOptions = highlightOptions(
      weight = 2,
      color = "white",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    # Tooltip label
    # country: %
    label = ifelse(
      is.na(world_data$percentage_yes),
      paste0('No data available.'),
      paste0(world_data$country,
              ": ",
              world_data$percentage_yes, "%")), 
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto"
    )
  ) |> 
  
  # Graph title
  addControl(
    html = "<div style='font-size: 16px; font-weight: bold; margin: 5px;'>
    Percentage of People that are interested in health, disease, or medicine</div>
    \n   (said 'Yes' on Question 9 in Gallup Poll 2018)",
    position = "bottomleft" # Adjust position as needed
  )

# Display the map
map_plot

Third Data Visualization Improvement

For this third plot, you must use one of the other ggplot2 extension packages mentioned this week (e.g., gganimate, plotly, patchwork, cowplot).

  1. Select a data visualization in the report that you think could be improved. Be sure to cite both the page number and figure title. Do your best to tell the story of this graph in words. That is, what is this graph telling you? What do you think the authors meant to convey with it?

I chose Chart 2.15 on page 40 titled, “Scatterplot exploring interest in science by those who have sought information.” I think this plot is trying to display the correlation between a country’s interest in science and the percentage of people that have sought information about health, disease, or medicine. The authors are trying to show that there is a positive correlation between these two variables. The more people are interested in science, the more likely they are to seek information about health, disease, or medicine.

  1. List the variables that appear to be displayed in this visualization.

The variables used in this plot are region, response to Question 25, and the percentage answered for some response level.

  1. Now that you’re versed in the grammar of graphics (ggplot), list the aesthetics used and which variables are specified for each.

The aesthetics used in this plot are: - the percentage of people that answered “yes” to survey question 8 “Would you, personally, like to know more about science?” is mapped to the x-axis. - the percentage of people that answered “yes” to survey question 7 “Have you, personally, tried to get any information about medicine, disease, or health in the past 30 days?” is mapped to the y-axis.

  1. What type of graph would you call this?

This is a scatterplot.

  1. List all of the problems or things you would improve about this graph.
  • I would map another variable to the color aesthetic: region.
  • I would add in a hover feature where you can see the country name and both of its corresponding percentages
  • I would make a point corresponding to the whole world as a region highlighted on the graph, so that viewers can compare countries’ opinions to the world as a whole.
  1. Improve the visualization above by either re-creating it with the issues you identified fixed OR by creating a new visualization that you believe tells the same story better.
Click to expand/collapse
# Create a data frame with the percentage of people that agree vaccines are safe worldwide
worldwide_q7_8 <- wellcome_data |>
  janitor::clean_names() |> 
  select(
    country:response, column_n_percent_4
  ) |> 
  # copy question to fill NAs
  fill(question) |> 
  filter(
    str_detect(question, pattern = "Q7") |
      str_detect(question, pattern = "Q8"),
    response == "Yes"
  ) |> 
  mutate(
    percentage = column_n_percent_4
  ) |> 
  group_by(question, response) |> 
  summarise(
    percentage = round(mean(percentage, na.rm = TRUE), 2),
    region = "World",
    country = "World"
  ) |>
  ungroup() |> 
  mutate(
    question = 
      case_when(
        str_detect(question, pattern = "Q8") ~ "Q8",
        str_detect(question, pattern = "Q7") ~ "Q7"
      
  )) |> 
  select(-response) |> 
  # pivot
  pivot_wider(
    names_from = question,
    values_from = percentage
  )


q7_8_data <- wellcome_data |>
  janitor::clean_names() |> 
  select(
    country:response, column_n_percent_4
  ) |> 
  fill(question) |> 
  filter(
    str_detect(question, pattern = "Q7") |
      str_detect(question, pattern = "Q8"),
    response == "Yes"
  ) |> 
  mutate(
    percentage = column_n_percent_4
  ) |> 
  select(country, question, percentage, response)  |> 
  left_join(country_region, by = "country") |> 
  mutate(
    region = ifelse(
      region == "Aus/NZ", "Australia/New Zealand", region),
    question = case_when(
      str_detect(question, pattern = "Q7") ~ "Q7",
      str_detect(question, pattern = "Q8") ~ "Q8"
    )
    ) |>
  group_by(country, region, question, response) |> 
  summarize(
    percentage = round(mean(percentage, na.rm = TRUE), 2)
  ) |> 
  ungroup() |> 
  select(-response) |> 
    # pivot
  pivot_wider(
    names_from = question,
    values_from = percentage
  )


# join with worldwide summary data
q7_8_data <- q7_8_data |> 
  bind_rows(worldwide_q7_8) 
Click to expand/collapse
medians <- q7_8_data |> 
  filter(region != "World") |>
  summarize(
    Q7 = median(Q7, na.rm = TRUE),
    Q8 = median(Q8, na.rm = TRUE)
  ) |> as_tibble();medians
# A tibble: 1 × 2
     Q7    Q8
  <dbl> <dbl>
1  0.43 0.715
Click to expand/collapse
# create a scatterplot comparing countries' interest in science and health, with q7 on the y axis
# write the code for the plot

# color palette
pal <- rainbow(length(unique(q7_8_data$region)), start = 0, end = 0.9)

plot <- q7_8_data |> 
  mutate(isWorld = ifelse(country == "World", TRUE, FALSE)) |>
  ggplot(mapping = aes(
    x = Q8,
    y = Q7,
    color = region,
    label = country,
    size = isWorld,
    shape = isWorld,
    text = ifelse(
      isWorld == FALSE,
      paste0("x:", Q8, "<br>y:", Q7, "<br>Country:", country, "<br>Region:",region),
      paste0("Worldwide<br>x:", Q8, "<br>y:", Q7)
    ))) +
  geom_jitter() +
  labs(
    title = "Interest in Science vs Interest in Health",
    subtitle = "Percentage of people that answered 'yes' to survey questions Q7 and Q8",
    x = "",
    y = ""
  ) +
  theme_bw() +
  theme(
    text = element_text(family = "sans"),
    legend.position = "none",
    plot.title = element_text(hjust = -0.129),
    plot.subtitle = element_text(hjust = -0.055,
                                 face = "italic"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_continuous(
    labels = scales::percent_format(scale = 100),
    breaks = seq(0, 1, by = 0.25),
    limits = c(0,1)
  ) +
  
  scale_y_continuous(
    labels = scales::percent_format(scale = 100),
    breaks = seq(0, 1, by = 0.25),
    limits = c(0,1)
  ) +
  
  scale_color_manual(values = pal) +
  coord_cartesian(xlim=c(0,1), ylim=c(0,1)) +
  # add median vertical and horizontal lines
  geom_hline(yintercept = medians$Q7, linetype = "dashed", color = "black") +
  geom_vline(xintercept = medians$Q8, linetype = "dashed", color = "black") +
  scale_shape_manual(values = c(15, 18)) +
  scale_size_manual(values = c(2, 4)) +
  annotate(
    "text", x = medians$Q8 + 0.09, y = 0, 
    label = paste0("Median: ", 100*medians$Q8, "%"), size = 4, color = "black"
    ) +
    annotate(
    "text", y = medians$Q7 + 0.05, x = 0.04, 
    label = paste0("Median: ", 100*medians$Q7,"%"), size = 4, color = "black"
    )


# Re-make Plot but implementing hover tooltips to show percentage and country name when you're on the point
plotly_plot <- plot |> 
  ggplotly(tooltip = c("text")) |> 
    (\(.) {
    .$x$layout$title$x <- 0 # Left justify title
    . # Return the modified plotly object
  })()


plotly_plot